A Hybrid Random Forests-boruta Feature Selection Algorithm for Biodegradibility Prediction
نویسندگان
چکیده
The a priori knowledge about biodegradability is adopted to save time and money for research and design of new products. Quantitative structure activity relationship (QSAR) models as a tool for biodegradability prediction of chemicals have been encouraged by environmental organizations. In the current work, a new algorithm has been proposed to investigate the importance of chemical descriptors to be used as input variables in modeling and prediction of biodegradability. The algorithm allows obtaining an ensemble of feature subsets compromising between model complexity and generalization performance. It utilizes random forests as classifier coupled with Boruta algorithm to automatically rank and omit descriptors based on Z-score. It is shown how four least relevant variables were identified and removed from model remaining generation ability. Furthermore, a hybrid feature selection method is developed to inspect weak relevant features and omit them in a loop mode in order to remain generalization of classifiers. The prediction accuracy of the new model showed improvements compared to previous
منابع مشابه
bootfs - Bootstrapped feature selection
The usage of the package is illustrated for three classification algorithms: pamr (Prediction analysis for Microarrays, [3], implementation in pamr -Rpackage), rf boruta (Random forests with the Boruta algorithm for feature selection, [2], implementation in Boruta-R-package) and scad (Support Vector Machines with Smoothly Clipped Absolute Deviation feature selection, [4], implementation in the ...
متن کاملEvaluation of variable selection methods for random forests and omics data sets.
Machine learning methods and in particular random forests are promising approaches for prediction based on high dimensional omics data sets. They provide variable importance measures to rank predictors according to their predictive power. If building a prediction model is the main goal of a study, often a minimal set of variables with good prediction performance is selected. However, if the obj...
متن کاملFeature Selection with the Boruta Package
This article describes a R package Boruta, implementing a novel feature selection algorithm for finding all relevant variables. The algorithm is designed as a wrapper around a Random Forest classification algorithm. It iteratively removes the features which are proved by a statistical test to be less relevant than random probes. The Boruta package provides a convenient interface to the algorith...
متن کاملFeature Selection and Predictive Modeling of Housing Data Using Random Forest
Predictive data analysis and modeling involving machine learning techniques become challenging in presence of too many explanatory variables or features. Presence of too many features in machine learning is known to not only cause algorithms to slow down, but they can also lead to decrease in model prediction accuracy. This study involves housing dataset with 79 quantitative and qualitative fea...
متن کاملIntelligent application for Heart disease detection using Hybrid Optimization algorithm
Prediction of heart disease is very important because it is one of the causes of death around the world. Moreover, heart disease prediction in the early stage plays a main role in the treatment and recovery disease and reduces costs of diagnosis disease and side effects it. Machine learning algorithms are able to identify an effective pattern for diagnosis and treatment of the disease and ident...
متن کامل